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Abstract 

Background: Genomic data are used in animal breeding to assist genetic evaluation. Several models to estimate 
genomic breeding values have been studied. In general, two approaches have been used. One approach estimates 
the marker effects first and then, genomic breeding values are obtained by summing marker effects. In the second 
approach, genomic breeding values are estimated directly using an equivalent model with a genomic relationship 
matrix. Allele coding is the method chosen to assign values to the regression coefficients in the statistical model. A 
common allele coding is zero for the homozygous genotype of the first allele, one for the heterozygote, and two 
for the homozygous genotype for the other allele. Another common allele coding changes these regression 
coefficients by subtracting a value from each marker such that the mean of regression coefficients is zero within 
each marker. We call this centered allele coding. This study considered effects of different allele coding methods 
on inference. Both marker-based and equivalent models were considered, and restricted maximum likelihood and 
Bayesian methods were used in inference. 

Results: Theoretical derivations showed that parameter estimates and estimated marker effects in marker-based 
models are the same irrespective of the allele coding, provided that the model has a fixed general mean. For the 
equivalent models, the same results hold, even though different allele coding methods lead to different genomic 
relationship matrices. Calculated genomic breeding values are independent of allele coding when the estimate of 
the general mean is included into the values. Reliabilities of estimated genomic breeding values calculated using 
elements of the inverse of the coefficient matrix depend on the allele coding because different allele coding 
methods imply different models. Finally, allele coding affects the mixing of Markov chain Monte Carlo algorithms, 
with the centered coding being the best. 

Conclusions: Different allele coding methods lead to the same inference in the marker-based and equivalent 
models when a fixed general mean is included in the model. However, reliabilities of genomic breeding values are 
affected by the allele coding method used. The centered coding has some numerical advantages when Markov 
chain Monte Carlo methods are used. 



Background different. Thus, this allele coding method does not give 

There has been growing interest in the use of marker- unique regression coefficients. 

based models [1] in recent years. In studies using these There are other allele coding methods such as the one 
models, descriptions of the effect of allele coding system that use coefficients -1,0, and 1 instead of 0, 1, and 2, 
on inference and computations are often vague or missing. respectively. Different allele coding methods affect coeffi- 
By allele coding, we refer to the coefficients in the marker cients in the statistical models but they do not seem to 
matrix of marker-based models. Coefficients, commonly change the amount of information for statistical inference, 
used for allele coding of a marker is 0 when the individual Hence, one would expect that the use of different allele 
is homozygous for the first allele, 1 when the individual is coding methods would lead to the same inference. How- 
heterozygous, and 2 when the individual is homozygous ever, allele coding can be of vital importance in computa- 
for the second allele. Depending on which of the alleles tions. First, convergence of iterative methods such as 
has been chosen as the first allele, the coefficients are Markov chain Monte Carlo (McMC) methods often used 

in Bayesian inference can be assumed to be affected by the 
allele coding method used because different allele coding 

* Correspondence: ismo.stranden@mtt.fi methods change the correlation structure between marker 
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effects. Second, equivalent models have become popular in 
animal breeding [2,3]. An important concept in these 
methods is the genomic relationship matrix. Differences in 
allele coding will yield different genomic relationship 
matrices [4]. Thus, some elements of the inverse of 
the coefficient matrix can be different and, consequently, 
reliabilities may be different. 

We investigated effects of different allele coding meth- 
ods using theoretical derivations and a practical example 
when restricted maximum likelihood (REML) or Baye- 
sian inference is used. Effects on parameter estimates, 
reliabilities, and McMC computations were studied. We 
considered marker models and their equivalent breeding 
value models. 

Methods 

Genomic marker model 

Let us consider a univariate linear mixed effects model 
for genomic marker effect estimation, e.g. [1], 

y=l/x+Zg + e, (1) 

where y is n x 1 vector of observations, \a is the gen- 
eral mean, Z is a n x m matrix containing a column for 
each marker locus, g is a m x 1 vector of random SNP 
marker effects, and e is a random residual vector. There 
can be other fixed or random effects in the model but 
their inclusion does not change the following 
derivations. 

Allele coding methods 

There are several alternatives for coding the coefficients 
in the Z matrix. Four allele coding systems are consid- 
ered. A simple transformation can be made from one 
allele coding system to another. Our basic allele coding 
system counts the number of copies of one of the 
alleles. Depending on which of the alleles is counted, 
the matrix can be different. In the allele coding system 
012, the number of copies of the less frequent allele is 
counted. Thus, the coefficient is 0 if the individual is 
homozygous for the more frequent allele, 1 if it is het- 
erozygous, or 2 if it is homozygous for the less frequent 
allele. In this case, the Z matrix for the basic allele cod- 
ing system 012 is denoted by Z 0 . 

A general form for the allele coding transformation 
from the basic allele coding system is Zn — \ n ^ m where 
\ m is a m x 1 vector. This allows many types of allele 
coding methods. Note that the transformation keeps dis- 
tances between allele codes within a marker the same. 
So, the coding 0,1,2 can be changed to - 1,0,1 or 0.5, 
1.5, 2.5 by this transformation, but not to -10,0,10. 

We define the centered allele coding system as Z c = 
Z 0 - P c , where each column of the matrix P c contains 
the average allele count for the corresponding marker 



column. Thus, summing values in each column will give 
a vector of zeros, i.e., l'„(Zn — P c ) is a vector of zeros. 
For the centered allele coding system, we have 
v m = ±Z' 0 l n , i.e., Z C = Z 0 - £l„l'„Zo. Note that v m /2 
gives the allele frequencies of the markers in the data. 

The allele coding transformation allows shifts in the 
allele codes. The 101 allele coding system is such that - 
1 is assigned to the genotype homozygous for the more 
frequent allele, 0 to the heterozygous individual, and 1 
to the individual homozygous for the less frequent allele. 
For the 101 allele coding system, we have v m = l m . The 
101 allele coding system is equal to the centered allele 
coding system when all allele frequencies are equal to 
0.5. 

In the following, the derivations will use the general 
allele coding transformation. The matrix Z 0 is unique. 
However, in general the decision on which of the alleles 
to count is arbitrary. Let the 210 allele coding system be 
such that the more frequent allele is counted. Then the 
Z matrix for the 210 allele coding system can be calcu- 
lated from the 012 coding matrix by 21„l' m — Zn where 
1„ is n x 1 vector of ones. The 210 allele coding system 
is the opposite to the 012 allele coding system but 
results in this paper apply to the 210 allele coding sys- 
tem as well, with some modifications mentioned 
separately. 

Different allele coding methods imply different models 
(1) and different models may lead to different parameter 
estimates. However, the inference considered in this 
paper is not affected by allele coding, as we demonstrate 
below. 

Inference in marker-based model 

Variance component estimation by restricted maximum 
likelihood (REML), prediction of random effects, and 
Bayesian inference are all based on the likelihood after 
marginalization of the fixed effects, i.e., the fixed effects 
have been integrated out. Bayesian inference requires 
even more marginalization. In order to show that infer- 
ence is not affected by allele coding, it is sufficient to 
show that the likelihood after integrating out the general 
mean is the same irrespective of allele coding. The fol- 
lowing derivation makes no assumptions on the prior 
densities of marker effects. Thus, the results apply to 
many models, including BLUP, BayesA and BayesB in 
[1]. 

The marginal likelihood for the mixed effects model is 

oo 

P{j\g,0) = I p{y\n,g,8)dn, 

— oo 

where p(y \ fi, g, 0 ) is the conditional density of y, 
often a Gaussian density, and 6 contains all parameters 
in the distribution of e, often only the residual variance 
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parameter er 2 . Using the transformation result (7) in 
Appendix A and a change of integration variable 
fio = H- — vJng, we can write 

oo 

P(yl& 0) = f Po(ylM - V fflg , g , o) dn 

—oo 
oo 

= I Po(ylMo,g,<?)dMo 

—oo 

= po(ylg,#), 

where p 0 denotes the 012 allele coding system. Hence, 
the marginal likelihood does not depend on allele cod- 
ing, a property used in the following derivations. 

The REML-likelihood is defined when g and e are 
multivariate Gaussian distributed, and equals 

L(0,v) = lp{y\g,0)p{g\v)dg, 

where 7] contains all parameters in the distribution of g, 
commonly only the genetic marker variance parameter 
<r 2 . This likelihood is independent of allele coding, and, 
hence, REML parameter estimation is independent of 
allele coding. Note that maximum likelihood estimation 
is based on L(fi, 0, J]) = j p{y | fi, g, 0)p(g \T])dg and is 
affected by allele coding because, in this case, the general 
mean is not integrated out. 

BLUP estimation of marker effects g assumes that the 
variance parameters (0, 7]) are known. The conditional 
distribution p{g | y, 0, 71) = p{y | g, 0)p(g | 7])/L(0, Tj) is 
independent of allele coding. Hence, BLUP g and asso- 
ciated uncertainties do not depend on the allele coding. 

In Bayesian inference, the joint posterior after inte- 
grating out ft is 

P(&Mly) = fp{n,g,0,r)\y)d[i 

= p(yl&0)p(gWp(M)/p(y), 

where pifi, 7] ) is the joint prior for the parameters, 
and the denominator p(y) is the integral of the numera- 
tor. All terms in the numerator are independent of allele 
coding, and by marginalization p{y) satisfies the same. 
Hence, p(g, 0, 7] | y) does not depend on allele coding. 

The general intercept p is, however, not independent 
of allele coding. For simplicity of the argument, we 
assume that parameters (0, 7]) are known, and omit 
showing these values. According to the transformation 
result (8) in Appendix A and a change of integration 
variable /Ho = M — v^g, the conditional expectation of 
the general mean is 

A = f nfp{n,g\y)dgdfi 
= fffipo{fi - V m g, g\y)dfidg 

= ff{^o+ v^gJpoC^o, g\y)d[i 0 dg (2) 

= /MoPo(Moly)dMo + iV m gpo(gly)dg 
= Ao + v^g, 



where p 0 denotes density for the 012 allele coding, 
and Ao i s the conditional expectation of the general 
mean when using the 012 allele coding. Thus, the gen- 
eral mean estimate is different by allele coding when 
v^g is not zero. When g and e are multivariate Gaus- 
sian distributed, the conditional expectations g and A 
equal the BLUP and BLUE estimates, respectively. 

Finally, the inference is indifferent to the allele being 
counted. This is demonstrated by studying the centered 
coding system and assuming that allele in the first mar- 
ker is counted in the opposite way, i.e., the first column 
in Z is minus the first column in Z c , or z 1 = -z cl . We 
see that Zg = Z c g where the entries in g are equal to the 
entries in g, except for the first entry which equals 
minus the first entry in g. Since g and g have the same 
distribution, these two models are equivalent. 

Genomic breeding values 
Estimating breeding values 

In breeding value evaluation, the main interest is in esti- 
mation of genomic breeding values for the genotyped 
animals. In other words, estimation of a = Zg where g 
are solutions to the marker effects by a marker-based 
model like model (1). Because the marker effect solu- 
tions are the same for different allele coding systems, 
the estimated genomic breeding values are different due 
to differences in the coefficient matrix Z. 

Allele coding does not, however, change relative dif- 
ferences between the estimated genomic breeding 
values, because Zg — Zog = — lntv^g) shows that they 
are just shifted by a constant. Let us define complete 
genomic breeding values as a,j = 1„A + Zg. Substituting 
Z = Zo — and using equation (2) we obtain 

a,; = 1„A + (Z 0 - IrcvJJg 
= 1„(A- v^g) +Z 0 g 

= InAo +Z 0 g. 

Consequently, the estimated complete breeding values 
ad are the same irrespective of allele coding. 
Equivalent model and allele coding 

Assume that the marker effects have a Gaussian distri- 
bution g ~ N(0, I m cr 2 ) where \ m is an m x m identity 
matrix. The breeding values a = Zg can be calculated 
directly without estimating g by the model [2,3] 

y = l n n + a + e, 

where the breeding values have prior density of 
a ~ N(0, ZZ'Og 2 ). Often the covariance matrix of the 
breeding values is scaled by a value such as 

m 

Vp = 2 ^pt(l — pi) where p t is the allele frequency of 

i=i 

marker i. Then, the breeding values have a prior density 
of a ~ N(0, Gcr 2 ) where the genomic relationship matrix 
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is G = ZZ'Jj- and genetic variance is er 2 = a gVp. Assuming 
that the residual distribution is e ~ N (0, R), then the 
mixed model equations for the equivalent model are 



R 1 In 



I'nR 1 

R 1 + G 1 
l'„R 1 





A 




a 



(3) 



The breeding value solutions a from these mixed 
model equations (3) are the same as the genomic breed- 
ing values calculated by a = Zg where g are marker 
effects estimated by the marker-based model (1). 

Therefore, the conclusion for the marker-based models 
about relative differences between genomic breeding 
values being unaffected by allele coding is also true for the 
equivalent model, although different allele coding methods 
lead to different genomic relationship matrices. Similarly, 
variance component estimation by REML and Bayesian 
methods are unaffected by the allele coding due to equiva- 
lence of models with the marker-based models. 

The mixed model equations in (3) are not well-defined 
when the genomic relationship matrix G is singular. 
However, mixed model equations not requiring an 
invertible G matrix do exist; see page 48 in [5]. The 
genomic relationship matrix G can be singular for sev- 
eral reasons. For example, there can be identical twins 
or clones that have the same genotypes. In addition, for 
the centered allele coding system the genomic relation- 
ship matrix is G c = Z C Z[.. The last row of 
Z c = (I — il„l' n )Zo is equal to the sum of all the other 
rows. Hence, Z c is not of full rank, and G c is singular. 
Prediction error variances and reliabilities 
Gaussian models are often used in practical genomic 
evaluation of animals. In these models, reliabilities of 
estimated breeding values a are calculated using ele- 
ments of the inverse of the mixed model equations such 
as (3). Reliability of a, is 



1 



PEV 

ff a G 'i' 



(4) 



where PEV, is the prediction error variance, i.e., Var(a, | 
y), of animal i, and G,; is the diagonal element of animal i 
in the genomic relationship matrix G; e.g. [6], p. 51 in [7]. 
The prediction error variance for animal i is the diagonal 
element of the inverse of the coefficient matrix of mixed 
model equations (3) for animal i. Alternatively, 



PEV 



Var(a|y) 

Var(Zg|y) 

ZVar(g|y)Z' 

: ZC S Z' ( 



(5) 



where is the genomic marker effect submatrix in the 
inverse of the coefficient matrix of the mixed model 
equation for marker-based model (1) (see Appendix B). 
The submatrix = Var(g | y) is the same irrespective of 
the allele coding method used as shown in the chapter 
on inference on marker-based models. Because the coef- 
ficient matrix Z is different depending on allele coding, 
PEV is also different depending on allele coding. Conse- 
quently, the reliability of a depends on allele coding. 

More generally, for any of the models considered in 
this paper, a = Zg where p(g | y) is independent of allele 
coding and Z depends on allele coding. Therefore, the 
distribution p(a | y) and, in particular, the variance-cov- 
ariance matrix Var(a | y) and reliabilities of a depend 
on allele coding. 

The complete breeding value distribution p{&d | y) 
does not depend on allele coding, unlike PEV associated 
with a. The proof is based on the demonstration that 
when applying any function f, the expectations are inde- 
pendent of the allele coding system, 

E[f(a d )|y] 

= ffiUv- +Zg)p(M,gly) dfidg 
= //(l n (^-V m g) + Zog) 

xpo(^-V m g,g|y)^g 
= //(l«Mo + Z 0 g)pn(Mo, g|y) dfj, 0 dg 
-EoIfCaaM 

where E 0 is the expectation when using the basic allele 
coding method. Therefore, the variance-covariance 
matrix Var(a^ | y) and all higher order moments of the 
distribution are independent of allele coding. However, 
the result does not provide actual formulas for the 
moments. 

A closed form formula of the variance-covariance 
matrix is derived for a Gaussian model. Assume 
g~N(0,I m er g 2 ) and e ~ N (0, R). For this model, in 
Appendix B we obtain that 

Var(ad|y)=rl n l' n 

+(I„ - rln^R-^ZCXZ'tl,, - rR-'lnl'J, 

where r = 1/(1'„R _1 1„) and 

C« = [Z'(R- 1 -rR" 1 l„l' n R" 1 )Z + I m /a x 2 ]- 1 . As demon- 
strated earlier in this section, this variance-covariance 
matrix is independent of allele coding. When R = l n cr^, 
the variance-covariance matrix simplifies to 



Var(a d |y) 

= l„i;cr> + Z c (Z£ c /cr 2 + Ualr l Tl c , 



(6) 



where Z c is based on the centered allele coding 
method. The diagonal elements in (6) are different from 
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the PEV/S in (4) because they contain uncertainty about 
the unknown mean pi as well. 

For the complete breeding values a^, we have shown 
above that prediction error variances are independent of 
allele coding and we have provided a formula for the 
Gaussian model. Reliabilities of a^, however, can not be 
defined in a meaningful way. Substituting the diagonal 
elements from (6) for the PEV,s in (4) is not appropriate 
since the denominator in (4) is Var(a,) not Var(ad)«. 
The denominator in the reliability formula should con- 
tain the marginal (unconditional) variance Var(aj) 2I - = / 
Var(jM + &i | pi)dpi, but this variance is infinite. 

McMC computations 
Theoretical convergence rate 

The convergence and mixing of an McMC algorithm 
depend on the parametrization of the model and on the 
algorithm used. Theoretical results about the geometric 
rate of convergence to the stationary distribution for 
Gibbs sampling algorithms are shown in [8], and as men- 
tioned in [9] this rate also describes the mixing of the 
algorithm. Below we show specific results about the con- 
vergence rate p for Gibbs sampling algorithms for simu- 
lating from [pi, g | y] in the marker-based model (1), 
where g is Gaussian g ~ N(0, I m er 2 ) and e is Gaussian 
e ~ N(0, Incr 2 ). These results provide some ideas about 
more general models and algorithms where theoretical 
results cannot be obtained. 

Section 2.2 in [8] contains results about various 
Gibbs-sampling schemes for simulating from a multi- 
variate distribution. Here we apply these results (see 
Appendix C for details) to two types of Gibbs updating 
schemes. The first scheme iterates between updating pi 
and a block of all components in g, and will be called 
the block updating scheme hereinafter. The second 
scheme updates pi, g x , g m sequentially one at a time, 
and will be called the single site updating scheme. 

For the block scheme, the convergence rate is 

n _! 
px = — z C z, 

a} * 

where z = Z'l„/n is a m x 1 vector, and 
C g = Z'Z/cr 2 + I m /cr g 2 . 
For the single site scheme, the convergence rate is 

Pi =p/„(L s 1 (D- 1 zz'nK 2 -UJ, 

where L g is the matrix containing the lower triangle 
and the diagonal of D" 1 Cg, D is the diagonal of Cg, and 
U g is the matrix containing the upper triangle of D" Cg. 
This single site Gibbs sampling algorithm is the stochas- 
tic counterpart of the Gauss-Seidel algorithm for solving 
the mixed model equations, and the convergence results 
are similar, see [10]. 



When the centered allele coding method 
Z c = (I n — l„l' n /n) Zn is used, z is a vector of zeros 
and, hence, pi = 0. The centered allele coding method 
breaks dependency between the general mean and 
genetic marker effects, as seen from the variance-cov- 
ariance matrix (derived in Appendix B for a more gen- 
eral situation) 

Var(/i, g|y) 

_ 'af/n 0 

Consequently, absorption of the general mean is done 
without needing to compute absorption explicitly. Note 
that, in general, this holds only when the residual var- 
iance-covariance matrix is Ier 2 . For the block McMC 
scheme, the convergence and mixing of the algorithms 
are of the same order as for non-McMC algorithms that 
simulate directly from the distribution of interest. For 
the single site McMC scheme, the centered allele coding 
method still breaks the dependence between the general 
mean and marker effects, but as p 2 > 0 illustrates, the 
individual marker effects gx , g m are not independent 
and the McMC samples are autocorrelated. 

Data and methods 
Data 

Data for the XII^ QTLMAS workshop [11] were used to 
illustrate the theory. The simulated data had four gen- 
erations. In each generation, 15 sires and 150 dams 
were selected randomly to produce the next generation. 
Each sire was mated to 10 dams and each mating pro- 
duced 10 progeny. Thus, the base generation had 165 
individuals, and the subsequent three generations, 1500 
individuals each. In total, the analyzed data had 4665 
animals with phenotypes. The simulated trait had a her- 
itability of 0.30. The data had 6000 equally spaced SNP 
markers on six chromosomes. We deleted markers that 
had a minor allele frequency less than 1% among the 
phenotyped individuals and this reduced the number of 
markers to 5896. 

The 012 allele coding method was used to make the 
base data set. So, the least frequent allele was counted. 
In addition, 210, 101, and centered allele coding data 
sets were analyzed. 
Variance component analysis 

The marker-based model (1) with common genetic var- 
iance was used to analyze the data: e ~ N(0, Icr 2 ) and 
g ~ N(0, log 2 ). McMC computations by a single site 
updating Gibbs sampler were used to calculate posterior 
mean estimates of the location (pi, g) and dispersion 
[PgiOg) parameters. The length of the McMC chain was 
100000 iterations, of which the burn-in period of 10000 
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was omitted. Every tenth sample was saved, giving 9000 
saved samples. Effective sample sizes were calculated for 
all parameters using the initial monotone sequence 
approach [12]. The approach estimates the number of 
independent samples from the post burn-in samples. 

Theoretical convergence rates were calculated for all the 
allele coding methods in the single site and block updating 
schemes. Here, the convergence rate also describes the 
mixing of the McMC chain and was measured by the cor- 
relation between successive McMC samples. Because of 
the Markov property, the /c-lag correlation is p k , where p is 
the convergence rate and k is the lag or distance between 
samples. In order to compare the theoretical convergence 
rate to the observed effective sample size, theoretical mix- 
ing was calculated relative to the 012 allele coding system 
as follows. Let p 0 i2 be the convergence rate from the 012 
allele coding system, and p the convergence rates from 
another allele coding system. Mixing of the other allele 
coding system is equal to mixing of the 012 allele coding 
system when every k th sample is taken from the McMC 
samples and p 0 i2 = p ■ Thus, the relative mixing is k = log 
(poi 2 )/log(p). 

Parameter estimation by REML for both the marker- 
based model and the equivalent model and for all four 
allele coding methods was done using software DMU 
[13]. Both AI-REML and EM-REML were investigated. 
For the equivalent model and the centered allele coding 
method, the singular genomic relationship matrix was 
modified by multiplying the diagonals by 1.001 in order 
to be able invert the matrix. The effect of allele coding 
on convergence was investigated, and it was checked 
that the parameter estimates were the same. 
Reliabilities of genomic breeding values 
Reliabilities were calculated by different allele coding 
methods using (4) and elements of the inverse of the 
mixed model equations. The variance components were 
those estimated by the centered allele coding method 
using the single site McMC approach. Prediction error 
variances were calculated by both the marker-based 
model equations (5) and equivalent model equations (3). 
Mean, minimum, maximum and standard deviations of 
reliabilities were calculated for all allele coding methods. 
Also, correlations between reliabilities from all allele 
coding methods were calculated. 

Results and Discussion 

Posterior mean estimates of marker effects and variance 
components were almost equal between different allele 
coding methods (Table 1). Correlations of estimates of 
the marker effects between allele coding methods were 
higher than 99.98%. Only the general mean (^) had a 
different estimate, as expected. The estimated variance 
components agreed well with those used to simulate the 
data. Additive genetic variance was cr 2 = o^Vp = 1.556, 



Table 1 Posterior means of selected parameters by allele 
coding 



Allele coding 


Parameter 


012 


210 101 centered 


f 


1.698 


0.801 1.083 1.359 




6.700 X 10" 4 


6.703 X 10" 4 6.691 X 10 4 6.698 X 10~ 4 




2.996 


2.996 2.996 2.996 




m 




where Vp 


= 2£p,(i 


-pi) = 2323.00, and p t are the 




1=1 



observed allele frequencies in the reference data. Thus, 
the heritability estimate was h 2 = cr^/[&^ + aj) = 0.34> 
compared to the simulated value of 0.30. 

Effective sample sizes differed depending on allele 
coding method. The centered allele coding method had 
the best mixing, and the 210 allele coding method had 
the worst (Table 2). In particular, the increase in effec- 
tive sample size was largest for the general mean. For 
the centered allele coding method, the general mean 
was independent from the marker effect g, which led to 
excellent mixing of this parameter. In general, the mar- 
ker effects showed excellent mixing. With all allele cod- 
ing methods, effective sample sizes were at least 5500 
for all marker effects, and on average were equal to 
about 8800. 

Theoretical convergence rates (Table 3) displayed the 
same results as the effective sample sizes discussed 
above. Note that our Gibbs sampler used single site 
updates for all parameters. For the single site updating 
algorithm, the 210 allele coding system was predicted to 
need 5.64 times more iterations than the 012 allele cod- 
ing system. The number of effective samples for the 
general mean parameter (fj.) was 5.11 times bigger for 
the 012 than for the 210 allele coding system. These fig- 
ures were 0.48 and 0.48 for the 101 allele coding system, 
and 0.070 and 0.0051 for the centered allele coding sys- 
tem. Theoretical convergence rates for the block Gibbs 
sampler showed the same pattern as for the single site 
update (Table 3). 

Surprisingly, the block Gibbs sampler was predicted to 
be worse than the single site Gibbs sampler for all allele 
coding systems except for the centered allele coding sys- 
tem. However, it is well known in the literature that 
block-updating schemes may sometimes be worse than 



Table 2 Effective sample sizes in McMC computations by 
allele coding 



Allele coding 


Parameter 


012 


210 


101 


centered 


fJ 


46 


9 


96 


8961 


°l 


723 


330 


1001 


1701 


-I 


7814 


6861 


7661 


7720 
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Table 3 Predicted absolute and relative convergence 



rates in McMC computations by allele coding 



Allele coding 






012 


210 


101 


centered 


single site 


P2 


0.9974795 


0.9995523 


0.9947515 


0.9647670 




relative 


1.00 


5.64 


0.48 


0.070 


block 


Pi 


0.9995429 


0.9998860 


0.9989434 


0.00 




relative 


1.00 


4.01 


0.43 





single site updating schemes, for examples see [8]. The 
excellent convergence rate of the block Gibbs sampler 
with centered allele coding was expected because, in 
this case, the Gibbs sampler is equal to Monte Carlo 
sampling. 

Table 4 shows convergence of REML in parameter esti- 
mation. For the marker-based model, the convergence 
was independent of the allele coding system, whereas for 
the equivalent model, the convergence was fastest for the 
centered coding system and slowest for the 210 coding 
system, although the differences were small. The para- 
meter estimates obtained (Table 5) were the same, with 
the exception of cr, 2 for the centered coding system. The 
difference is due to the need to make the genomic rela- 
tionship matrix G c to be full rank by multiplication of the 
diagonals by 1.001. In summary, REML parameter esti- 
mation is only slightly affected by allele coding. 

Reliabilities were affected depending on the allele coding 
method used. Differences were large (Table 6). Average 
reliabilities ranged from 0.37 with the 210 allele coding 
method to 0.80 with the centered allele coding method. 
The centered allele coding method gave higher reliabilities 
than achieved by any of the other allele coding methods. 
Reliabilities calculated by different allele coding methods 
were also different as judged by the correlation to each 
other (Table 7). Reliabilities calculated by the marker- 
based model and the equivalent model approaches were 
equal within the numerical rounding error. 

The observed large differences in reliabilities using dif- 
ferent allele coding methods can be explained by differ- 
ences in estimation uncertainty. Different allele coding 



Table 4 Number of iterations in REML by allele coding 
and model 



Model 


Allele coding 


AI-REML 


EM-REML 


Marker 


012 


9 


45 


Marker 


210 


9 


-15 


Marker 


101 


9 


45 


Marker 


centered 


9 


45 


G 


012 


7 


34 


G 


210 


9 


44 


G 


101 


7 


31 


G 


centered 


7 


28 



systems have different Z matrices. Consider first the 012 
and 210 allele coding methods. The 012 allele coding sys- 
tem has a 0 coefficient when the individual is homozygous 
for the more frequent allele while the 210 allele coding 
system has a coefficient of 2 instead. In the marker-based 
model, uncertainty or the inverse of the coefficient matrix 
is the same irrespective of allele coding method. Reliability 
is calculated by multiplying the marker uncertainty by the 
Z matrix (5). Consequently, uncertainty is less in the 012 
allele coding system than in the 210 allele coding system 
because the more frequent homozygous allele multiplies 
the marker solution and uncertainty by zero. Thus, homo- 
zygous genotypes for the more frequent allele do not 
increase uncertainty when estimating genomic breeding 
values. Thus, the 012 allele coding system will yield higher 
reliabilities than the 210 allele coding system, as was 
observed. This argument can be generalized as follows. In 
the genomic model considered, uncertainty of a genotype 
in estimating genomic breeding value is valued relative to 
a chosen base genotype. The further away an observed 
genotype is from the base genotype, the larger the coeffi- 
cient in absolute value in the Z matrix and the higher the 
uncertainty in genomic breeding value. In the 012 allele 
coding system, the base genotype is homozygous for the 
more frequent allele, while in the 210 allele coding, it is 
homozygous for the less frequent allele. In the 101 allele 
coding system the base genotype is the heterozygote. The 
higher the number of heterozygous individuals is in the 
data the smaller will the uncertainty be, i.e., the higher the 
reliability will be for the 101 allele coding system. For the 
centered allele coding system the base genotype is the 
average genotype in the data. Thus, for this allele coding 
system the base population is roughly the population we 
work with [14], and it has the smallest average distance of 
observed genotypes from the base genotype. In practice, 
this can be expected to lead to the highest reliabilities. 

Different allele coding systems have different model 
design matrices Z, and, hence, imply different models. 
Thus, reliabilities from different allele coding systems are 
in fact from different statistical models. Comparison of 
reliabilities from different models is meaningless. How- 
ever, the different allele coding systems lead to the same 
parameter estimates. If the correct allele coding method, 
i.e., statistical model, is known, it should be used. Because 
the true model is unknown and comparison of reliabilities 
by allele coding method is meaningless, some principles 
must be used to decide on which allele coding method 
should be used. These principles will not guarantee the 
use of a correct model or correct reliabilities. One such 
principle should be consistency of reliabilities between 
evaluations. The centered allele coding method changes 
model from one evaluation to the next because more mar- 
ker data accumulate. Hence, according to the consistency 
principle, it cannot be recommended to computate 
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Table 5 REML estimates by allele coding 



Allele coding 


Model 


Parameter 


012 


210 


101 


centered 


Marker 




6.623 X 1 0~ 4 


6.623 X 1 0~ 4 


6.623 X 1 0~ 4 


6.623 X 1 0~ 4 


Marker 




2.993 


2.993 


2.993 


2.993 


G 




1.540 


1.540 


1.540 


1.540 


G 




2.993 


2.993 


2.993 


2.992 



reliabilities. Likewise, the base genotype in the 012 and 210 
allele coding methods depend also on the observed allele 
frequencies, i.e., marker data. 

The centered allele coding method is similar to that 
introduced in [4] where the allele frequencies were from 
an unselected base population. It was used in order to 
"give more credit to rare alleles than to common alleles 
when calculating genomic relationships". As shown, 
inference is the same irrespective of the allele coding 
method when a fixed general mean is in the model. 
However, reliabilities are affected as shown. The use of 
base population allele frequencies in the centered allele 
coding method would remove the above mentioned pro- 
blem of inconsistency between evaluations, but estimat- 
ing these allele frequencies is elusive. Recently, [15] 
presented a method for adjusting the G c relationship 
matrix to become a relationship matrix relative to the 
base population, thereby avoiding the estimation of base 
population allele frequencies. 

The results in this paper are based on the assumption 
that phenotypes and genotypes are available for all animals 
in the analysis. This assumption may often not be satisfied. 
Models based on an extension of the genomic relationship 
matrix to include also non-genotyped animals have been 
presented by [16-18]. The results in the present paper 
about parameter estimates and estimated breeding values 
not depending on allele coding do not carry over to the 
models with an extended genomic relationship matrix. 

Conclusions 

We showed that, in theory, different allele coding meth- 
ods led to the same inference in marker-based models 
when the model has a fixed general mean effect. Practi- 
cal analyses led to the same conclusions. Also in theory, 
the centered allele coding method was expected to give 



better mixing properties when Markov chain Monte 
Carlo methods were used. This was also observed in 
practice. When an equivalent breeding value model was 
used, different allele coding methods proved to lead to 
the same inference as in the marker-based model. How- 
ever, reliabilities of breeding values depend on the cho- 
sen allele coding system because different allele coding 
methods change the amount of uncertainty in the esti- 
mated breeding values. 

Appendix A 

In the following, we consider the effect of allele coding 
method on the densities p(y \ fi, g) and p(fi, g, | y). For 
simplicity of presentation, the parameters in the distri- 
bution of g and e are omitted. Let p 0 denote density for 
012 allele coding. Because the location parameters [i and 
marker effects g relate to the observations y only 
through \ n \A + Zg, we first study this term. By substitut- 
ing Z = Zrj — lnv'm into the term, we have 

1„U, + Zg = l n fl + Z 0 g - lWmS 
= 1„(m -V^g) +Z 0 g. 

So, when different allele coding systems are used, the 
densities have equality by 

P(ylM.g)=po(ylM-V m g,g). (7) 

By changing the integration variable Ho = M — v^g, we 
obtain j p (y | fi, g) dy, = / p 0 (y | fi 0 , g) dfi 0 and, hence, 
P(y) = 5 5 P(Y \ ft, g)p(g)d(idg = p 0 (Y)- From these 
results, we see that 

P0*/gly) =P(yl^g)P(g)/P(y) 

= Po(yl/x-v;„g / g)po(g)/Po(y) (8) 

= po(M-v' m g,g|y). 



Table 6 Summary statistics of genomic breeding value 

reliabilities by allele coding Table 7 Correlations between genomic breeding value 



Allele coding 


min 


mean 


max 


std 


reliabilities by allele coding 






012 


0.41 


0.49 


0.59 


0.022 


Allele coding 210 


101 


centered 


210 


0.30 


0.37 


0.42 


0.017 


012 -0.22 


0.32 


0.43 


101 


0.55 


0.62 


0.73 


0.024 


210 


0.82 


0.59 


centered 


0.72 


0.80 


0.95 


0.026 


101 




0.91 
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The results (7) and (8) are fairly general in terms of 
distributional assumptions. The only requirements are 
that p{y | fi, g) depends on fi and g only through l n fi + 
Zg, that an improper uniform prior is used for fi, and 
that p(y) is finite. The later requirement is to assure the 
posterior distribution becomes a proper distribution, 
and this has to be proven for a model to be valid when 
an improper prior is used. 

When e ~ MO, R), it is not difficult to show that 

/ p(y\fi, g) dn oc exp(- \ (y - Zg)'M(y - Zg)) 

with M = R 1 - VC x \ n V n VC x \{V n VC x \ n \ Therefore, J" 
P (y I (*> g) dfi < c o where the constant c 0 is independent 
of g. Thus, p(y) = J J p(y \ ft, g)dfip{g)dg < c 0 j p(g)dg = 
'"• finite, irrespective of distribution of the marker 



tg 
c 0 is 
effects g. 



Appendix B 

We consider a Gaussian distribution model where 
g ~ N(0, \ m a 2 ) and e ~ N (0, R). Consequently, the dis- 
tribution [ft, g | y] is a multivariate Gaussian distribu- 
tion. In the following, we derive the variance-covariance 
matrix for this distribution. The conditional density is 

p(/x,g|y) ocp(y|g,/x)p(g) 

ocexp(-i(y- l^-ZgJ'R-^y- l n H — Zg) 



ocexp [-j[fx-p. g'-g'JQ 



where 



l± — fi 



Q= ryar(/x,g|y)]- 



With r=l/{l' n R- 1 l n ), z r = Z'R 1 l„ and 
Cg = Z'R _1 Z+ l m a~ 2 . The matrix Q is the coefficient 
matrix in the mixed model equations and the inverse of 
this matrix is 



Var(/i, g|y) 



r + r 2 z' r C s z r 
-rC«Zr 



-T%<3~ 



where = (C g — rir^.) -1 . Note that the submatrix 
C s = Var(g | y) is independent of allele coding, because, 
as shown in the main text, p(g \ y) does not depend on 
allele coding. 

Variance-covariance matrix for the complete breeding 
value is 



r + r 2 z' r C s z r 


-rz' t O~ 




"l'„" 


-rC«z r 


a 




Z 



Var(a,j|y) 
= [1„Z] 

= rl n l' n + r 2 l„z;_C«z r l' n 

-rZC«z r l^ - rl„z'C«Z' + ZC S Z' 

+(I„ - rl^l^R^ 1 ) ZC«Z'(I„ - rR- 1 !,,^). 



Appendix C 

The results in [8] state that the convergence rate of a 
Gibbs sampler is equal to the largest modulus eigenva- 
lue of a certain matrix B where the eigenvalues can be 
complex numbers. As mentioned in [9] this convergence 
rate is also a a measure of correlation between succes- 
sive McMC samples, i.e., mixing of the algorithm. The 
closer the convergence rate is to zero the less correlated 
are the successive samples. 

The B matrix is constructed as follows. Let Q be the 
inverse of the variance-covariance matrix of the target 
multivariate normal distribution, in our case the coeffi- 
cient matrix in the mixed model equations. Assume that 
a Gibbs sampling scheme is used where the variables 
are grouped into s blocks and Q is split accordingly into 
5 blocks. First, define 



A = I 



diag(Qu\ 



Let L be the block lower triangular matrix with blocks 
in the lower diagonal being those of A, and let U = A - 
L. Thus, U is a strictly upper triangle matrix with zeros 
in the diagonal. Then the matrix of interest is 

B = (I-L^U. 

We consider the genomic marker model (1) where g is 
Gaussian g~N(0,I m cr 2 ) and e is Gaussian 
e ~ N{0,l n a 2 ). The conditional distribution [fi, g | y] is 
a multivariate normal distribution with some mean vec- 
tor and a variance-covariance matrix with inverse 



Q=[Var( M ,g|y)]- 



n/a e 2 l' n Z/a 2 
Z'l n /cr 2 C g 



where Cg = TlZja 2 + \ m \a 2 ; see Appendix B. We con- 
sider two McMC updating schemes. The first scheme 
iterates two blocks: fi and g. The second scheme 
updates successively all parameters: ft, gi, g m . For the 
first McMC updating scheme we have 
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A = I 



0 



0 

C7 1 



0 -z' 
-C^fnz/a 2 ) 0 

where z = Z'l„/n is a m x 1 vector. Hence, 



B 



-C7 1 (HZ/cr?) 0 



0 -z 
0 0 



1 



0 



0 -2. 

0 0 



-C^zn/a 2 I„ 
"0 -z' 

0 C^zz'n/cr 2 

The convergence rate is 

Pi = Pi„(B) 

= p !v (CJ 1 zz'n/cr 2 ) 

= A z'CT'z, 



where P; V (B) of a matrix B is a notation for the maxi- 
mum modulus eigenvalue of B. The final equality fol- 
lows from a general property for a square matrix form 
Cw' where v is a vector, saying that it only has one 
eigenvalue different from zero which is equal to v'Cv. 

For the second McMC update scheme, 



A = I 



i 



o 

D 



D-\nz/a e 2 ) D l C s 
0 -z' 
D-^nz/a 2 ) l m -D l C. 



where D is the diagonal of C g . Hence, 

-l 

1 U 

B = 

'D-^n/cr 2 L s 



where U g is an upper triangular matrix containing the 
upper triangle of D" 1 C g and L g is a matrix containing 
the diagonal and lower triangle of D" 1 C g . Therefore, 





"0 


-z' " 




0 





1 0 

B = 

"0 z' 

_0 L7 1 (D- 1 zz'n/a 2 -UJ 
The convergence rate is 



Pi = P/,(B) = p /v (L7 1 (D- 1 zz'n/cr 2 - U g )). 
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